We, Joel Addala, Chris Chhotai, and Shreyansh Patel, hereby state that we have not communicated with or gained information in any way from any person or resource that would violate the College’s academic integrity policies, and that all work presented is our own. In addition, we also agree not to share our work in any way, before or after submission, that would violate the College’s academic integrity policies
R : R version 4.2.2 (2022-10-31 ucrt)
RStudio 2023.03.0+386 “Cherry Blossom” Release (3c53477afb13ab959aeb5b34df1f10c237b256c3, 2023-03-09) for Windows Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) RStudio/2023.03.0+386 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36
library(ggplot2)
library(ggthemes)
library(dplyr)
library(plotly)
library(caret)
library(knitr)
library(kableExtra)
library(naniar)
library(modeest)
library(patchwork)
library(plotly)
The Data was collected by an annual survey conducted by the CDC (USA), called the Behavioral Risk Factor Surveillance System (BRFSS). It consists of questions related to multiple factors such as Age, Lifestyle Habits, Various Medical Conditions, Gender, Financial Situations, and Education. This is done to figure out the tests and preventive measures needed for preventing heart diseases.
data = read.csv("Heart Disease Health Indicator.csv")
# Table View of data
kbl(data[1:6,]) %>%
kable_paper() %>%
scroll_box(width = "100%")
| HeartDiseaseorAttack | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | Diabetes | PhysActivity | Fruits | Veggies | HvyAlcoholConsump | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 40 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 5 | 18 | 15 | 1 | 0 | 9 | 4 | 3 |
| 0 | 0 | 0 | 0 | 25 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 0 | 7 | 6 | 1 |
| 0 | 1 | 1 | 1 | 28 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 5 | 30 | 30 | 1 | 0 | 9 | 4 | 8 |
| 0 | 1 | 0 | 1 | 27 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 11 | 3 | 6 |
| 0 | 1 | 1 | 1 | 24 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 3 | 0 | 0 | 0 | 11 | 5 | 4 |
| 0 | 1 | 1 | 1 | 25 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 2 | 0 | 1 | 10 | 6 | 8 |
str(data)
## 'data.frame': 253680 obs. of 22 variables:
## $ HeartDiseaseorAttack: num 0 0 0 0 0 0 0 0 1 0 ...
## $ HighBP : num 1 0 1 1 1 1 1 1 1 0 ...
## $ HighChol : num 1 0 1 0 1 1 0 1 1 0 ...
## $ CholCheck : num 1 0 1 1 1 1 1 1 1 1 ...
## $ BMI : num 40 25 28 27 24 25 30 25 30 24 ...
## $ Smoker : num 1 1 0 0 0 1 1 1 1 0 ...
## $ Stroke : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Diabetes : num 0 0 0 0 0 0 0 0 2 0 ...
## $ PhysActivity : num 0 1 0 1 1 1 0 1 0 0 ...
## $ Fruits : num 0 0 1 1 1 1 0 0 1 0 ...
## $ Veggies : num 1 0 0 1 1 1 0 1 1 1 ...
## $ HvyAlcoholConsump : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AnyHealthcare : num 1 0 1 1 1 1 1 1 1 1 ...
## $ NoDocbcCost : num 0 1 1 0 0 0 0 0 0 0 ...
## $ GenHlth : num 5 3 5 2 2 2 3 3 5 2 ...
## $ MentHlth : num 18 0 30 0 3 0 0 0 30 0 ...
## $ PhysHlth : num 15 0 30 0 0 2 14 0 30 0 ...
## $ DiffWalk : num 1 0 1 0 0 0 0 1 1 0 ...
## $ Sex : num 0 0 0 0 0 1 0 0 0 1 ...
## $ Age : num 9 7 9 11 11 10 9 11 9 8 ...
## $ Education : num 4 6 4 3 5 6 6 4 5 4 ...
## $ Income : num 3 1 8 6 4 8 7 4 1 3 ...
n_miss(data)
## [1] 0
any_na(data)
## [1] FALSE
miss_var_summary(data)
## # A tibble: 22 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 HeartDiseaseorAttack 0 0
## 2 HighBP 0 0
## 3 HighChol 0 0
## 4 CholCheck 0 0
## 5 BMI 0 0
## 6 Smoker 0 0
## 7 Stroke 0 0
## 8 Diabetes 0 0
## 9 PhysActivity 0 0
## 10 Fruits 0 0
## # … with 12 more rows
The Data has 253,680 records and 22 Variables, with no null or NA values.
table(data$HeartDiseaseorAttack)
##
## 0 1
## 229787 23893
prop.table(table(data$HeartDiseaseorAttack))
##
## 0 1
## 0.90581441 0.09418559
It is observed that the data is severely imbalanced. To solve this issue the “downSample()” function from the caret package is used. The original data set is down sampled because there is too much data in the majority class.
Down Sampling works by reducing the Majority class to an equal size of the Minority class.
dataset <- downSample(x = data[, -ncol(data)], y=factor(data$HeartDiseaseorAttack))
str(dataset)
## 'data.frame': 47786 obs. of 22 variables:
## $ HeartDiseaseorAttack: num 0 0 0 0 0 0 0 0 0 0 ...
## $ HighBP : num 0 0 0 0 1 0 0 1 0 1 ...
## $ HighChol : num 0 1 0 0 1 0 0 0 0 0 ...
## $ CholCheck : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BMI : num 26 21 18 20 31 23 31 26 39 25 ...
## $ Smoker : num 0 0 1 0 1 1 0 0 0 1 ...
## $ Stroke : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Diabetes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PhysActivity : num 1 1 1 1 0 1 1 1 0 1 ...
## $ Fruits : num 1 0 1 0 0 0 0 1 1 1 ...
## $ Veggies : num 1 1 1 1 0 1 0 1 1 1 ...
## $ HvyAlcoholConsump : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AnyHealthcare : num 1 1 1 1 1 1 1 1 1 1 ...
## $ NoDocbcCost : num 0 0 0 0 0 0 0 0 0 0 ...
## $ GenHlth : num 2 2 1 2 2 2 1 1 2 3 ...
## $ MentHlth : num 0 0 0 2 0 0 1 1 0 0 ...
## $ PhysHlth : num 0 2 0 0 4 0 2 0 0 0 ...
## $ DiffWalk : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sex : num 1 0 0 0 1 1 1 1 0 0 ...
## $ Age : num 13 11 12 6 9 13 8 9 2 11 ...
## $ Education : num 6 4 5 6 5 6 5 6 6 6 ...
## $ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
table(dataset$HeartDiseaseorAttack)
##
## 0 1
## 23893 23893
It is observed that the total number of records have been reduced from 253,680 to only about 47,786. The ratio of the classes have also become 1:1.
BMI is one of the numeric variables in the data set. To understand the potential values of BMI in the data set, the frequency table is created.
kbl(table(dataset$BMI)) %>%
kable_paper() %>%
scroll_box(height = "300px", width = "200px")
| Var1 | Freq |
|---|---|
| 12 | 2 |
| 13 | 4 |
| 14 | 12 |
| 15 | 30 |
| 16 | 77 |
| 17 | 179 |
| 18 | 302 |
| 19 | 693 |
| 20 | 976 |
| 21 | 1510 |
| 22 | 2263 |
| 23 | 2530 |
| 24 | 3424 |
| 25 | 3193 |
| 26 | 3731 |
| 27 | 4501 |
| 28 | 3211 |
| 29 | 2873 |
| 30 | 2941 |
| 31 | 2497 |
| 32 | 2182 |
| 33 | 1856 |
| 34 | 1520 |
| 35 | 1142 |
| 36 | 1004 |
| 37 | 878 |
| 38 | 709 |
| 39 | 633 |
| 40 | 467 |
| 41 | 383 |
| 42 | 331 |
| 43 | 350 |
| 44 | 224 |
| 45 | 182 |
| 46 | 138 |
| 47 | 121 |
| 48 | 113 |
| 49 | 95 |
| 50 | 70 |
| 51 | 47 |
| 52 | 55 |
| 53 | 58 |
| 54 | 25 |
| 55 | 38 |
| 56 | 22 |
| 57 | 19 |
| 58 | 13 |
| 59 | 8 |
| 60 | 14 |
| 61 | 6 |
| 62 | 6 |
| 63 | 8 |
| 64 | 6 |
| 65 | 2 |
| 66 | 6 |
| 67 | 5 |
| 68 | 3 |
| 69 | 1 |
| 70 | 2 |
| 71 | 9 |
| 72 | 2 |
| 73 | 8 |
| 75 | 5 |
| 76 | 1 |
| 77 | 8 |
| 79 | 14 |
| 80 | 1 |
| 81 | 10 |
| 82 | 6 |
| 84 | 8 |
| 85 | 1 |
| 87 | 5 |
| 88 | 2 |
| 89 | 2 |
| 90 | 1 |
| 92 | 7 |
| 95 | 2 |
| 98 | 3 |
ggplot(dataset, aes(BMI)) +
geom_histogram(fill = "#8bbf7e") +
labs(x = "BMI",
y = "Count",
title = "Distribution of BMI",
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_hc() +
theme(
axis.title = element_text()
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
BMI_q1 = quantile(dataset$BMI, 0.25)
BMI_q2 = quantile(dataset$BMI, 0.75)
BMI_iqr = IQR(dataset$BMI)
bmi_lower_limit = BMI_q1 - (1.5 * BMI_iqr)
bmi_upper_limit = BMI_q2 + (1.5 * BMI_iqr)
print(bmi_lower_limit)
## 25%
## 12
print(bmi_upper_limit)
## 75%
## 44
clean_bmi_ds <- dataset %>% filter(BMI >= bmi_lower_limit & BMI <= bmi_upper_limit)
kbl(table(clean_bmi_ds$BMI)) %>%
kable_paper() %>%
scroll_box(height = "300px", width = "200px")
| Var1 | Freq |
|---|---|
| 12 | 2 |
| 13 | 4 |
| 14 | 12 |
| 15 | 30 |
| 16 | 77 |
| 17 | 179 |
| 18 | 302 |
| 19 | 693 |
| 20 | 976 |
| 21 | 1510 |
| 22 | 2263 |
| 23 | 2530 |
| 24 | 3424 |
| 25 | 3193 |
| 26 | 3731 |
| 27 | 4501 |
| 28 | 3211 |
| 29 | 2873 |
| 30 | 2941 |
| 31 | 2497 |
| 32 | 2182 |
| 33 | 1856 |
| 34 | 1520 |
| 35 | 1142 |
| 36 | 1004 |
| 37 | 878 |
| 38 | 709 |
| 39 | 633 |
| 40 | 467 |
| 41 | 383 |
| 42 | 331 |
| 43 | 350 |
| 44 | 224 |
ggplot(clean_bmi_ds, aes(BMI)) +
geom_histogram(fill = "#8bbf7e") +
labs(x = "BMI",
y = "Count",
title = "Distribution of BMI",
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_hc() +
theme(
axis.title = element_text()
)
summary(clean_bmi_ds$BMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 24.00 27.00 28.27 32.00 44.00
mlv(clean_bmi_ds$BMI, method = "mfv")
## [1] 27
Q4. Since the data is normally distributed, no transformation is required.
Q5. The appropriate measure of central tendency is already calculated as above.
*Q6. The Visualization shows that the BMI variable is a normal distribution. To verify, the Mean, Median, and Mode is calculated. With those three values it is clear that the center of the curve is at the center of the distribution.
In this plot, the PhysHlth variable is mapped to identify how the data is distributed. This variable was directed as a question asking “Out of the past 30 days, how many were you physically sick?”. Therefore the data is a continuous range of 0-30 days.
kbl(table(clean_bmi_ds$PhysHlth)) %>%
kable_paper() %>%
scroll_box(height = "300px", width = "200px")
| Var1 | Freq |
|---|---|
| 0 | 25819 |
| 1 | 1669 |
| 2 | 2526 |
| 3 | 1581 |
| 4 | 921 |
| 5 | 1562 |
| 6 | 340 |
| 7 | 931 |
| 8 | 184 |
| 9 | 37 |
| 10 | 1349 |
| 11 | 19 |
| 12 | 130 |
| 13 | 18 |
| 14 | 542 |
| 15 | 1310 |
| 16 | 25 |
| 17 | 22 |
| 18 | 40 |
| 19 | 3 |
| 20 | 875 |
| 21 | 152 |
| 22 | 18 |
| 23 | 18 |
| 24 | 18 |
| 25 | 381 |
| 26 | 23 |
| 27 | 30 |
| 28 | 155 |
| 29 | 77 |
| 30 | 5853 |
Q1. All values are within 30, representing the response out of 30 days, therefore there are no possibilities of outliers in this variable.
Q2. Since there are no outliers for this variable there is no need to clean the data and handle them.
library(ggplot2)
ggplot(clean_bmi_ds, aes(x = PhysHlth)) +
geom_density(fill = "#1C86EE", alpha = 0.8) +
labs(x = "Physical Health",
y = "Density",
title = "Distribution of Physical Health",
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_minimal() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.caption = element_text(hjust = 0)
)
It is observed from the plot itself that the variable is skewed to the Right. We can also check the Mean, median, and mode to cross verify the skewness.
summary(clean_bmi_ds$PhysHlth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 6.276 7.000 30.000
mlv(clean_bmi_ds$PhysHlth, method = "mfv")
## [1] 0
Q4. It is observed that the Median and mode are 0 and the mean is 6.31. However, the Maximum value is 30 representing the rightward skewness of the plot. Therefore it is best to apply some transformation in order to make the data normally distributed.
To normalize the variable the Log of the variable can be taken into consideration as there is a long right tail which is plotted below.
library(ggplot2)
phys_health <- data.frame(PhysHlth = clean_bmi_ds$PhysHlth)
ggplot(phys_health, aes(x = log10(PhysHlth + 1))) +
geom_density(fill = "#1C86EE", alpha = 0.8) +
labs(x = "Physical Health",
y = "Density",
title = "Distribution of Physical Health",
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_minimal() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.caption = element_text(hjust = 0)
)
summary(log10(clean_bmi_ds$PhysHlth + 1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4482 0.9031 1.4914
In this plot, the MentHlth variable is mapped to identify how the data is distributed. This variable was directed as a question asking “Out of the past 30 days, how many were you mentally sick?”. Therefore the data is a continuous range of 0-30 days.
kbl(table(clean_bmi_ds$MentHlth)) %>%
kable_paper() %>%
scroll_box(height = "300px", width = "200px")
| Var1 | Freq |
|---|---|
| 0 | 31863 |
| 1 | 1283 |
| 2 | 2144 |
| 3 | 1280 |
| 4 | 673 |
| 5 | 1600 |
| 6 | 203 |
| 7 | 565 |
| 8 | 113 |
| 9 | 10 |
| 10 | 1269 |
| 11 | 7 |
| 12 | 72 |
| 13 | 10 |
| 14 | 225 |
| 15 | 1160 |
| 16 | 20 |
| 17 | 12 |
| 18 | 21 |
| 19 | 5 |
| 20 | 739 |
| 21 | 64 |
| 22 | 13 |
| 23 | 6 |
| 24 | 7 |
| 25 | 265 |
| 26 | 6 |
| 27 | 19 |
| 28 | 62 |
| 29 | 39 |
| 30 | 2873 |
Q1. Since, all values are within 30, representing the response out of 30 days, therefore there are no possibilities of outliers in this variable.
Q2. Since there are no outliers for this variable there is no need to clean the data and handle them.
ggplot(clean_bmi_ds, aes(x = MentHlth)) +
geom_density(stat = "density", color = "#EEEE00", fill = "#EEEE00", alpha = 0.3, bw = 2) +
geom_vline(aes(xintercept = median(MentHlth)), color = "red", linetype = "dashed", size = 1) +
labs(x = "Mental Health",
y = "Density",
title = "Distribution of Mental Health",
subtitle = paste("Median =", median(clean_bmi_ds$MentHlth)),
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_minimal() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.caption = element_text(hjust = 0),
plot.subtitle = element_text(size = 10, hjust = 0.5)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
The variable is shown to be skewed to the right from the plot itself. To further confirm the skewness, we can additionally cross-check the Mean, Median, and Mode.
summary(clean_bmi_ds$MentHlth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.745 2.000 30.000
mlv(clean_bmi_ds$MentHlth, method = "mfv")
## [1] 0
Q4. It has been noted that the mean is 3.756 and that the median and mode are also 0. The plot’s rightward skewness is represented by the Highest value of 30, which is nevertheless. Consequently, in order to make the data properly distributed, it is best to perform some change.
To normalize the variable the Log of the variable can be taken into consideration as there is a long right tail.
library(ggplot2)
ment_health <- data.frame(MentHlth = clean_bmi_ds$MentHlth)
ggplot(ment_health, aes(x = log10(clean_bmi_ds$MentHlth + 1))) +
geom_density(stat = "density", color = "#EEEE00", fill = "#EEEE00", alpha = 0.3, bw = 2) +
geom_vline(aes(xintercept = median(MentHlth)), color = "red", linetype = "dashed", size = 1) +
labs(x = "Mental Health",
y = "Density",
title = "Distribution of Mental Health",
subtitle = paste("Median =", median(clean_bmi_ds$MentHlth)),
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_minimal() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.caption = element_text(hjust = 0),
plot.subtitle = element_text(size = 10, hjust = 0.5)
)
summary(log10(clean_bmi_ds$MentHlth + 1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2958 0.4771 1.4914
According to the Data description provided along with the data set, it is understood that the age is a factor of 13 levels with each level representing the age as a collection of 5 years. As such, a bar plot is used to identify the distribution of the categorical variable.
ggplot(clean_bmi_ds, aes(x = factor(Age))) +
geom_bar( fill = "#63B8FF") +
scale_x_discrete(labels = c("1" = "Group 1", "2" = "Group 2", "3" = "Group 3", "4" = "Group 4",
"5" = "Group 5", "6" = "Group 6", "7" = "Group 7", "8" = "Group 8",
"9" = "Group 9", "10" = "Group 10", "11" = "Group 11",
"12" = "Group 12", "13" = "Group 13")) +
labs(x = "Age",
y = "Count",
title = "Distribution of Age Groups",
subtitle = "As Count",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text(),
axis.text.x = element_text(angle = 90)
)
ggplot(clean_bmi_ds, aes(x = factor(Age))) +
geom_bar( aes(y = ..prop.., group = 1),stat = "count",position = 'stack',fill = "#63B8FF") +
scale_y_continuous(labels = scales::percent_format())+
scale_x_discrete(labels = c("1" = "Group 1", "2" = "Group 2", "3" = "Group 3", "4" = "Group 4",
"5" = "Group 5", "6" = "Group 6", "7" = "Group 7", "8" = "Group 8",
"9" = "Group 9", "10" = "Group 10", "11" = "Group 11",
"12" = "Group 12", "13" = "Group 13")) +
labs(x = "Age",
y = "Proportions",
title = "Distribution of Age Groups",
subtitle = "As Proportions",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text(),
axis.text.x = element_text(angle = 90)
)
## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
unique(clean_bmi_ds$Age)
## [1] 13 11 12 6 9 8 2 5 4 10 7 3 1
Q1. This code produces a bar plot where the x-axis represents the different Age groups and the y-axis represents the count of observations in each group. The labels of the x-axis have been adjusted to show the actual Age groups.
Q2. The code below that produces a bar plot where the x-axis represents the different Age groups and the y-axis represents the distribution of proportions in each group. The labels of the x-axis have been adjusted to show the actual Age groups.
Q3. After observing the plot, it is clear that the higher number of records are in the older age groups, with majority of the data belonging to groups 9 to 11.
Q4. The number of unique values are not too many as all of the data fit into 13 categories. That said, it can be more efficient to have lesser number of factors in the variable.
According to the Data description provided along with the data set, it is understood that the age is a factor of 13 levels with each level representing the age as a collection of 5 years. As such, a bar plot is used to identify the distribution of the categorical variable.
ggplot(clean_bmi_ds, aes(x = factor(Education))) +
geom_bar(fill = "#FF82AB") +
scale_x_discrete(labels = c("1" = "Group 1", "2" = "Group 2", "3" = "Group 3", "4" = "Group 4",
"5" = "Group 5", "6" = "Group 6")) +
labs(x = "Education",
y = "Count",
title = "Distribution of Education Groups",
subtitle = "As Count",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text(),
axis.text.x = element_text(angle = 90)
)
ggplot(clean_bmi_ds, aes(x = factor(Education))) +
geom_bar(aes(y = ..prop.., group = 1),stat = "count",position = 'stack', fill = "#FF82AB") +
scale_y_continuous(labels = scales::percent_format())+
scale_x_discrete(labels = c("1" = "Group 1", "2" = "Group 2", "3" = "Group 3", "4" = "Group 4",
"5" = "Group 5", "6" = "Group 6")) +
labs(x = "Education",
y = "Proportions",
title = "Distribution of Education Groups",
subtitle = "As Proportions",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text(),
axis.text.x = element_text(angle = 90)
)
unique(clean_bmi_ds$Education)
## [1] 6 4 5 2 3 1
Q1. This code produces a bar plot where the x-axis represents the Education groups and the y-axis represents the count of observations in each group. The labels of the x-axis have been adjusted to show the actual Education groups.
Q2. The code below that produces a bar plot where the x-axis represents the Education groups and the y-axis represents the distribution of proportions in each group. The labels of the x-axis have been adjusted to show the actual Education groups.
Q3. The plot shows that the majority of individuals fall under education groups 4, 5, and 6, whereas education group 1 has the lowest count of individuals. Additionally, the plot does not show any extreme outliers or unexpected patterns, and the distribution seems relatively balanced.
Q4. This plot has six unique values for the Education variable in the given plot which seems appropriate and makes the plot easy to interpret. It allows for clear differentiation between the education groups and avoids clutter in the plot while still providing enough detail to understand the distribution of education levels in the dataset.
ggplot(clean_bmi_ds, aes(x = factor(GenHlth))) +
geom_bar(fill = "#EE9572") +
scale_x_discrete(labels = c("1" = "Group 1", "2" = "Group 2", "3" = "Group 3", "4" = "Group 4",
"5" = "Group 5")) +
labs(x = "General Health",
y = "Count",
title = "Distribution of General Health",
subtitle = "As Count",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text(),
axis.text.x = element_text(angle = 90)
)
ggplot(clean_bmi_ds, aes(x = factor(GenHlth))) +
geom_bar(aes(y = ..prop.., group = 1),stat = "count",position = 'stack', fill = "#EE9572") +
scale_y_continuous(labels = scales::percent_format())+
scale_x_discrete(labels = c("1" = "Group 1", "2" = "Group 2", "3" = "Group 3", "4" = "Group 4",
"5" = "Group 5", "6" = "Group 6")) +
labs(x = "General Health",
y = "Proportions",
title = "Distribution of General Health",
subtitle = "As Proportions",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text(),
axis.text.x = element_text(angle = 90)
)
unique(clean_bmi_ds$GenHlth)
## [1] 2 1 3 4 5
Q1. This code produces a bar plot where the x-axis represents the General Health and the y-axis represents the count of observations in each group. The labels of the x-axis have been adjusted to show the actual Education groups.
Q2. The code below that produces a bar plot where the x-axis represents the Education groups and the y-axis represents the distribution of proportions in each group. The labels of the x-axis have been adjusted to show the actual Education groups.
Q3. There are no unusual observations in the plot, and the distribution of general health appears to be relatively balanced, with the highest number of participants in Group 3 (which corresponds to “good” general health) and the lowest number of participants in Group 1 (which corresponds to “poor” general health).
Q4. The variable “General Health” has five distinct values. In general, it is preferable to keep the number of distinct values on a categorical variable to a manageable number, as having too many categories can cause the plot to become cluttered and difficult to interpret. The number of distinct values in this plot appears to be appropriate, striking a good balance between providing sufficient information and maintaining clarity.
The BMI is plotted against the continuous variable PhysHlth. These values are the responses where participants stated their number of physically sick days out of the past 30 days.
kbl(table(clean_bmi_ds$PhysHlth)) %>%
kable_paper() %>%
scroll_box(height = "300px", width = "200px")
| Var1 | Freq |
|---|---|
| 0 | 25819 |
| 1 | 1669 |
| 2 | 2526 |
| 3 | 1581 |
| 4 | 921 |
| 5 | 1562 |
| 6 | 340 |
| 7 | 931 |
| 8 | 184 |
| 9 | 37 |
| 10 | 1349 |
| 11 | 19 |
| 12 | 130 |
| 13 | 18 |
| 14 | 542 |
| 15 | 1310 |
| 16 | 25 |
| 17 | 22 |
| 18 | 40 |
| 19 | 3 |
| 20 | 875 |
| 21 | 152 |
| 22 | 18 |
| 23 | 18 |
| 24 | 18 |
| 25 | 381 |
| 26 | 23 |
| 27 | 30 |
| 28 | 155 |
| 29 | 77 |
| 30 | 5853 |
All values are within 30, representing the response out of 30 days, therefore there are no possibilities of outliers in this variable.
ggplot(clean_bmi_ds, aes(BMI, PhysHlth)) +
geom_point() +
geom_smooth(method = "auto", se=FALSE) +
labs(
title = "BMI vs. Number of Sick Days Experienced",
subtitle = "",
x = "BMI",
y = "No. Of Sick Days (PhsyHlth)",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text()
)
Q1. A scatter plot depicts the relationship between two continuous variables, BMI and Physical Health. The alpha argument to the “geom_point()” function specifies how opaque the points are when there are many data points, preventing overplotting. Because of the theme bw() function, the plot has a black and white theme.
Q2. Physical health and BMI have a negative relationship. This means that as a person’s BMI increases, so does their physical health. The strength of the relationship appears to be moderate, as there is a noticeable downward trend in the data points, but there is still some variability in the data.
Q3. On observation, it is clear that for smaller BMI, the No of sick days decreases with increase in BMI. However, the reverse happens after a certain point (BMI = 22) resulting with the PhysHlth Variable increasing with BMI. The shape appears to be an inverted bell curve.
Q4. It can be inferred that at optimal BMI (20-24) the No. of sick days is the least, while lower or higher BMIs lead to higher No of sick days.
The BMI is plotted against the continuous variable MentHlth. These values are the responses where participants stated their number of Mentally sick days out of the past 30 days.
kbl(table(clean_bmi_ds$MentHlth)) %>%
kable_paper() %>%
scroll_box(height = "300px", width = "200px")
| Var1 | Freq |
|---|---|
| 0 | 31863 |
| 1 | 1283 |
| 2 | 2144 |
| 3 | 1280 |
| 4 | 673 |
| 5 | 1600 |
| 6 | 203 |
| 7 | 565 |
| 8 | 113 |
| 9 | 10 |
| 10 | 1269 |
| 11 | 7 |
| 12 | 72 |
| 13 | 10 |
| 14 | 225 |
| 15 | 1160 |
| 16 | 20 |
| 17 | 12 |
| 18 | 21 |
| 19 | 5 |
| 20 | 739 |
| 21 | 64 |
| 22 | 13 |
| 23 | 6 |
| 24 | 7 |
| 25 | 265 |
| 26 | 6 |
| 27 | 19 |
| 28 | 62 |
| 29 | 39 |
| 30 | 2873 |
All values are within 30, representing the response out of 30 days, therefore there are no possibilities of outliers in this variable.
ggplot(clean_bmi_ds, aes(BMI, MentHlth)) +
geom_point() +
geom_smooth(method = "auto", se=FALSE) +
labs(
title = "BMI vs. Number of Stressed Days Experienced",
subtitle = "",
x = "BMI",
y = "No. Of Stressed Days (MentHlth)",
caption = "Source: BRFSS 2015 | Kaggle",
) +
theme_hc() +
theme(
axis.title = element_text()
)
Q1. Mental Health vs. BMI A scatter plot with points representing individual observations and a smoothed line indicating the overall trend.
Q2. On observation, it is clear that for smaller BMI, the No of sick days decreases with increase in BMI. However, the reverse happens after a certain point (BMI = 24) resulting with the MentHlth Variable increasing with BMI. The shape appears to be an inverted bell curve.
Q3. The number of mentally ill days seem to be generally lower than the number of physically sick days. However, it follows the same pattern as PhysHlth when compared to BMI.
Q4. This means that for optimal BMI, mental sickness is relatively lesser than those which higher BMI.
The Physical Health is plotted against the continuous variable Meantal Health. These values are the responses where participants stated their number of physically sick days and meantally sick out of the past 30 days.
kbl(table(clean_bmi_ds$PhysHlth,clean_bmi_ds$MentHlth)) %>%
kable_paper() %>%
scroll_box(height = "300px", width = "200px")
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20780 | 632 | 1019 | 593 | 293 | 598 | 71 | 205 | 35 | 4 | 405 | 4 | 12 | 3 | 65 | 283 | 6 | 1 | 4 | 0 | 135 | 12 | 1 | 1 | 2 | 42 | 4 | 3 | 9 | 11 | 586 |
| 1 | 1039 | 159 | 117 | 75 | 36 | 65 | 5 | 27 | 5 | 0 | 37 | 0 | 4 | 1 | 7 | 36 | 0 | 1 | 0 | 0 | 17 | 2 | 0 | 0 | 0 | 4 | 0 | 0 | 1 | 0 | 31 |
| 2 | 1568 | 158 | 193 | 98 | 61 | 121 | 20 | 44 | 10 | 0 | 65 | 0 | 5 | 0 | 11 | 43 | 1 | 0 | 2 | 0 | 29 | 1 | 0 | 0 | 0 | 12 | 0 | 1 | 3 | 2 | 78 |
| 3 | 941 | 77 | 109 | 99 | 38 | 67 | 9 | 30 | 10 | 1 | 37 | 0 | 4 | 0 | 9 | 39 | 0 | 3 | 0 | 1 | 34 | 1 | 1 | 0 | 0 | 3 | 0 | 2 | 1 | 1 | 64 |
| 4 | 552 | 28 | 62 | 37 | 46 | 25 | 9 | 14 | 4 | 0 | 26 | 1 | 3 | 1 | 9 | 29 | 1 | 0 | 1 | 0 | 16 | 5 | 1 | 0 | 0 | 8 | 0 | 0 | 4 | 0 | 39 |
| 5 | 870 | 49 | 102 | 53 | 27 | 139 | 6 | 27 | 6 | 0 | 86 | 0 | 3 | 0 | 14 | 38 | 1 | 1 | 1 | 0 | 37 | 1 | 1 | 0 | 1 | 11 | 0 | 0 | 1 | 0 | 87 |
| 6 | 193 | 4 | 28 | 15 | 7 | 10 | 20 | 7 | 4 | 1 | 9 | 0 | 1 | 0 | 1 | 6 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 26 |
| 7 | 556 | 25 | 57 | 37 | 10 | 36 | 3 | 60 | 2 | 1 | 18 | 0 | 4 | 0 | 11 | 21 | 1 | 0 | 0 | 0 | 14 | 6 | 1 | 0 | 0 | 6 | 0 | 1 | 3 | 0 | 58 |
| 8 | 92 | 8 | 9 | 9 | 7 | 3 | 3 | 2 | 4 | 0 | 9 | 0 | 1 | 0 | 1 | 13 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 1 | 12 |
| 9 | 18 | 1 | 3 | 4 | 1 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 2 |
| 10 | 697 | 29 | 92 | 42 | 20 | 92 | 5 | 15 | 7 | 0 | 117 | 0 | 4 | 0 | 2 | 62 | 1 | 0 | 0 | 0 | 48 | 0 | 1 | 0 | 0 | 15 | 0 | 0 | 2 | 0 | 98 |
| 11 | 10 | 1 | 0 | 0 | 1 | 3 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 12 | 57 | 2 | 8 | 4 | 5 | 7 | 4 | 1 | 2 | 0 | 5 | 0 | 7 | 0 | 0 | 4 | 0 | 0 | 1 | 0 | 5 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 1 | 13 |
| 13 | 2 | 1 | 1 | 1 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 14 | 316 | 10 | 24 | 14 | 12 | 25 | 3 | 15 | 2 | 0 | 10 | 0 | 0 | 0 | 31 | 7 | 1 | 0 | 1 | 0 | 14 | 6 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 1 | 46 |
| 15 | 602 | 23 | 54 | 37 | 15 | 101 | 7 | 22 | 8 | 0 | 78 | 0 | 3 | 0 | 3 | 139 | 0 | 2 | 1 | 0 | 67 | 3 | 1 | 1 | 0 | 18 | 0 | 0 | 2 | 1 | 122 |
| 16 | 10 | 0 | 3 | 2 | 1 | 3 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17 | 7 | 0 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 18 | 12 | 0 | 0 | 0 | 2 | 2 | 1 | 1 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 7 |
| 19 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 20 | 350 | 13 | 58 | 15 | 13 | 54 | 0 | 2 | 5 | 0 | 74 | 1 | 1 | 1 | 5 | 63 | 0 | 0 | 0 | 1 | 85 | 0 | 1 | 1 | 0 | 25 | 0 | 1 | 4 | 3 | 99 |
| 21 | 88 | 0 | 3 | 4 | 1 | 9 | 1 | 5 | 0 | 0 | 3 | 0 | 1 | 0 | 5 | 3 | 0 | 0 | 0 | 0 | 4 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 |
| 22 | 4 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 3 |
| 23 | 8 | 0 | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 24 | 5 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 3 |
| 25 | 129 | 8 | 8 | 11 | 3 | 26 | 4 | 4 | 0 | 1 | 32 | 0 | 3 | 0 | 1 | 27 | 0 | 0 | 1 | 2 | 23 | 1 | 1 | 1 | 0 | 32 | 0 | 1 | 0 | 1 | 61 |
| 26 | 5 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 2 |
| 27 | 12 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 0 | 1 | 0 |
| 28 | 46 | 2 | 6 | 7 | 2 | 5 | 3 | 3 | 0 | 0 | 8 | 0 | 1 | 1 | 1 | 13 | 2 | 1 | 1 | 0 | 5 | 1 | 0 | 0 | 1 | 5 | 1 | 0 | 7 | 1 | 32 |
| 29 | 31 | 3 | 2 | 2 | 1 | 4 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 2 | 6 | 11 |
| 30 | 2862 | 48 | 181 | 117 | 67 | 195 | 24 | 79 | 7 | 2 | 232 | 0 | 13 | 1 | 49 | 310 | 3 | 1 | 2 | 0 | 176 | 11 | 3 | 1 | 2 | 67 | 0 | 5 | 16 | 7 | 1372 |
quantiles <- quantile(clean_bmi_ds$MentHlth, c(0.025, 0.975))
ggplot(clean_bmi_ds, aes(x = PhysHlth, y = MentHlth)) +
geom_jitter() +
stat_smooth(method = "auto", se = FALSE) +
stat_smooth(aes(ymin = quantiles[1], ymax = quantiles[2]), alpha = 0.2, geom = "ribbon", fill = "blue") +
labs(
title = "Physical Health vs. Mental Health",
x = "Physical Health",
y = "Mental Health",
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_hc() +
theme(
axis.title = element_text()
)
Q1. The plot includes a scatterplot of the data points with Physical Health on the X-Axis and Mental Health on the Y-Axis and a smoothed line to show the overall trend in the data. In addition, a ribbon is added to show the 95% confidence interval of the smoothed line.
Q2. The scatter plot shows that as physical health improves, so does mental health. However, the relationship is not very strong because mental health scores vary greatly at each level of physical health. The relationship is positive, which means that as one variable increases, so does the other. Overall, the plot suggests that physical health is a poor predictor of mental health outcomes and that other factors
Q3. The scatterplot shows a moderate negative linear relationship between the two variables, suggesting that people with better physical health tend to have better mental health, and vice versa. The blue ribbon in the plot represents the 95% confidence interval of the mean mental health score for each level of physical health.There also appears to be a concentration of observations near the lower end of the physical health scale, indicating that a significant proportion of individuals in the dataset report poor physical health.
Q4. The plot shifts, especially at lower levels of physical health, where the data appears more scattered. This variation suggests that factors other than physical health may influence mental health. The strength of the correlation, as calculated in #2 above, indicates that the trend is statistically significant but not very strong.
Education variable is plotted against the BMI variable. The Education Variable is a categorical variable which has 6 levels, each indicating a higher form of education.
plot <- ggplot(clean_bmi_ds) +
geom_boxplot(aes(factor(Education), BMI), fill = "#8bbf7e") +
scale_x_discrete(label = c(
"1" = "Level 1", "2" = "Level 2", "3" = "Level 3",
"4" = "Level 4", "5" = "Level 5", "6" = "Level 6")) +
labs(
title = "Education Levels vs. BMI",
subtitle = "Distribution of BMI across different Education Levels",
caption = "Source: BRFSS 2015 | Kaggle",
x = "Education Category",
y = "BMI"
) +
theme_hc() +
theme(axis.title = element_text(),
axis.text.x = element_text(angle = 0))
ggplotly(plot) %>%
layout(yaxis = list(fixedrange = TRUE),
xaxis = list(fixedrange = TRUE)) %>%
config(displayModeBar = FALSE)
Q1. This scatter plot, with Education Category on the x-axis and BMI on the y-axis, the “labs” function adds a title, caption, and axis labels to the plot.
Q2. The form of the relationship appears to be non-linear.On Observation, it is noted that the distribution reduces along with the increase in education level.
Q3. This means on average, the individuals with higher education levels have lower BMI than those of lower education level.
Q4. In the boxplot, the variability in BMI within each education level can be observed by looking at the size of the boxes and the whiskers. If the boxes for each education level are large and the whiskers extend far from the box, then there is greater variability in BMI within that education level. On the other hand, if the boxes are small and the whiskers are short, then there is less variability in BMI within that education level.
Education variable is plotted against the BMI variable. The Education Variable is a categorical variable which has 6 levels, each indicating a higher form of education.
plot <- ggplot(clean_bmi_ds, aes(x = factor(Age), y = BMI)) +
geom_jitter(alpha = 0.5, color = "#8bbf7e", width = 0.2) +
stat_summary(fun.y = mean, geom = "point", shape = 20, size = 3, fill = "white", color = "black") +
stat_summary(fun.y = mean, geom = "errorbar", width = 0.2, linewidth = 1, color = "black") +
geom_hline(yintercept = mean(clean_bmi_ds$BMI), linetype = "dashed", color = "red") +
geom_text(aes(label = round(after_stat(y), 1)), stat = "summary", vjust = -0.5, size = 3) +
labs(
title = "Age vs. BMI",
subtitle = "Distribution of BMI across different Age Groups",
caption = "Source: BRFSS 2015 | Kaggle",
x = "Age",
y = "BMI"
) +
theme_classic() +
theme(axis.title = element_text(),
axis.text.x = element_text(angle = 0))
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
plot
## No summary function supplied, defaulting to `mean_se()`
Q1.The plot is a scatterplot with jittered points representing the BMI values for different age groups. The vertical axis shows BMI values, and the horizontal axis shows different age group
Q2. There appears to be a weak, positive relationship between age and BMI. As age increases, BMI values tend to increase slightly, although there is a significant amount of variability in BMI values within each age group. The shape of the relationship appears to be linear, with a slight positive slope.
Q3. The strength of the relationship is weak, as there is a large amount of overlap in BMI values between different age groups.The dots and error bars do not show a clear pattern of increasing or decreasing BMI values as age increases, which further supports the weak relationship observed in the plot.
Q4. In the Age vs. BMI plot, we observe a lot of variability in BMI values at each age level, with some ages having a wider spread of BMI values than others. This corresponds to the moderate strength of the positive relationship we calculated earlier, as there is a clear overall trend of increasing BMI with increasing age, but there is still a fair amount of variability in BMI within each age group.The dashed red line represents the overall mean BMI for the entire dataset, and we can see that the mean BMI for each age group tends to be close to the overall mean.
library(ggplot2)
ggplot(clean_bmi_ds) +
geom_boxplot(aes(x = factor(GenHlth), y = BMI), outlier.shape = NA, fill = "lightblue") +
scale_x_discrete(label = c(
"1" = "Group 1", "2" = "Group 2", "3" = "Group 3",
"4" = "Group 4", "5" = "Group 5")) +
labs(
title = "Two variables",
subtitle = "Distribution of BMI across different General Health Groups",
caption = "Source: BRFSS 2015 | Kaggle",
x = "General Health (GenHlth)",
y = "BMI"
)
Q1.The plot is a scatterplot with jittered points representing the BMI values for different General Health groups. The vertical axis shows BMI values, and the horizontal axis shows different General Health Group.
Q2. The plot shows a slight negative relationship between the two variables, with higher BMI values associated with lower levels of self-reported general health.The relationship appears to be weak, with a lot of variability in BMI values across all levels of general health.
Q3. The plot indicates that there is a positive relationship between general health status and BMI. In other words, individuals who report better general health tend to have lower BMI, and those who report poorer general health tend to have higher BMI.
Q4. In the plot, we observe a considerable amount of variability in BMI for each level of general health. Specifically, there is overlap in the range of BMI values between different levels of general health, indicating that BMI alone may not be a reliable indicator of general health status.We can see from the plot that there is a general trend of increasing BMI with poorer reported general health.